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Abstract 

We consider the dynamics of player’s strategies in repeated market games, 
where the selection of strategies is determined by a learning model. Prior 
theoretical analysis and experimental data show that after large number of 
plays the average number of agents who decide to enter, per round of the 
game, approaches the market capacity and, after a longer wait, agents are 
being sorted into two groups: the agents in one group rarely enter the mar¬ 
ket, and in the other, the agents enter almost all the time. In this paper 
we obtain estimates of the characteristic times it takes for both patterns to 
emerge in the repeated plays of the game. The estimates are given in terms 
of the parameters of the game, assuming that the number of agents is large, 
the number of rounds of the game per unit of time is large, and the charac¬ 
teristic change of the propensity per game is small. Our approach is based on 
the analysis of the partial differential equation for the function f(t,q ) that 
describes the distribution of agents according to their level of propensity to 
enter the market, q, at time t. 

Keywords: Market entry games, Reinforcement learning, Drift-Diffusion 
equations 


1. Introduction 

A class of games in the study of social and econonmic behavior, called 
market entry games, describes a conflict situation when players (agents) in a 
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group choose between two strategies: enter the market or stay out, and the 
agent’s payoff is determined solely by the number of agents who decide to 
enter and the action he or she takes. 

The game has a single symmetric, mixed equilibrium and, a number of 
asymmetric pure and mixed equilibria. 

The extensive theoretical and experimental work has been done on under¬ 
standing which, if any, of the equilibrium strategies emerge when the game 
is played repeatedly by not agents who, independently from each other, try 
to adopt to changing “market conditions.” The situation can by formalized, 
by introducing into a model the individual propensities for agents to play 
a particular strategy. The propensities are updated after each round of the 
game. They might be determined by the agent’s payoffs, as in the basic re¬ 
inforcement learning model, introduced in Erev and Roth (1998), or might 
depend on more information about the game available to agents, such as in 
the fictitious stochastic play, see Fudenberg and Levine (1998). 

When the number of players is large, the following patterns of behavior 
are typically observed and predicted by learning models, see for example 
Duffy and Hopkins (2003): 

1. The average number of entries per round of the game quickly approaches 
the market capacity. This is referred to as “aggregate learning”. 

2. In a long-run of repeated plays agent’s strategies converge to an asym¬ 
metric pure equilibrium, compatible with the market capacity. This is 
called “sorting”. 

Both phenomena are ubiquitous in the market entry games in which 
agents use either basic reinforcement learning or fictitious stochastic play. 
It is observed that the aggregate learning emerges quite quickly and it takes 
much longer time to observe sorting, see Duffy and Hopkins (2003). 

The purpose of the present paper to give an estimate on the time scales of 
both phenomena, in terms of the number of agents, N, the number of games, 
M , played per unit of time, and the characteristic payoff h per game. We 
show that the time of the aggregate learning is of the order 

1 

Tal ~ MNh ’ 

and the time after which the sorting becomes noticeable is 

1 

Ts = MNh 2 ‘ 
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One might expect that the formulas like these are only valid in a certain 
asymptotic regime, since the game is stochastic. This is indeed the case 
as the estimates are derived under the conditions that N, M are large, h is 
small, and MNh is finite. Interestingly, the estimates are the same for both 
models of the basic reinforcement learning and fictitious stochastic play. 

Our approach is based on the derivation of a partial differential equation 
for the the distribution of agents among the propensity line. The equation is 
a drift-diffusion equation, with the drift velocity proportional to (r a ;) _1 , and 
the diffusion coefficient proportional to (t s ) _1 . 


2. The game and adaptive learning models 


There are N agents participating in the game. Let 8 l denote the indicator 
function for agent i to enter the game: 5 l = 1 if the agent enters, and 5 l = 0, 
otherwise. Let c € N be the capacity of the market, that we take for the 
simplicity of the presentation to be an integer. Let m be the number of the 
agents who enter the market, h be the characteristic payoff, and v > 0 be 
the compensation for participating in the game. Then, the payoff to agent i, 
can be defined, for example, as 


f v if <$* = (), 

\ v + h(c — m) if 5* = 1, 


see Erev and Rapoport (1998). 

In the basic reinforcement learning, due to Erev and Roth (1998), the 
game is played repeatedly, and the state of the agent i, is defined by the 
propensities to enter and stay out after n th round of the game: 

(9l,n>92,n) G 

The probability that agent i uses in deciding to enter is given by 

i Ql,n 

Vn = — - -~T- 

Ql,n + ,n 

To reduce the number of parameters, in order to simplify the presentation, 
let us assume that v — 0. In this case the propensity to stay out does not 
changes in time: q l 2 n = q l 2 0 . Furthermore, let us assume that for any % = 1..N, 
and some q G R, the propensity to stay our are the same for all agents: 


^ 2 , n = Qo > 0 . 
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Consiquently we need to consider only one the propensity to enter the market 
which we denote by q l n . Under such assumptions the probability for agent i 
to enter the marker equals 


To use this formula one has to make sure that propensities q l n stay nonneg¬ 
ative. In fact, the explicit formula for the probability function will not be 
needed in our analysis, and we opt to use a generic probability function 

yn = p(qh)i (!) 

where p = p(q ) G [0,1] is strictly increasing, twice differentiable function 
such that 

p(— oo) = 1 — p(+ oo) = 0. 

In this way, the nonnegativty of the propensities is not required. 

We consider two models of learning. In the model of basic reinforcement 
(with v = 0), by Erev and Roth (1998), the propensity is increased/decreased 
by the amount of the payoff in (n + l) th game: 

Qh+i = q'n + hS'nic-mn), ( 2 ) 

where S l n is the indicator function of the action of player i in n th game, and 
m n is the number of agent who enter the game. 

In second model, the agents have more information about the game, which 
is relfected by fact that the propensity in (J2]) is increased by the amount of 
the payoff agent i would get if he/she played the opposite strategy: 

Qh+i =qh + h(c~m n ) -h{l-S* n ), (3) 

see Duffy and Hopkins (2005). 

2.1. The method of the distribution function 

Using the theory of stochastic approximation of Benai'm (1999), Duffy 
and Hopkins (2005) prove that the repeated market entry games with ei¬ 
ther basic reinforcement learning or fictitious stochastic play, under rather 
generic condition, the agents strategies converge with probability one, to an 
asymmetric pure strategy equilibrium. In that approach, models ([2]) and ((3j) 
are considered as dynamical systems of size N, that describe the individual 
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propensity of all agents, as they involve under the stochastic updating rule. 
It is quite remarkable that the asymptotic behavior can be established for 
such complicated systems. 

In this paper we take a different approach, which is based on the deriva¬ 
tion of the kinetic drift-diffusion equation for the distribution of the agents 
according to their propensity levels. 

Define the time step r = 1/M, where M is the number of rounds of the 
game per unit of time. The game takes place at times 

t n = nr, n = 1, 2, 3... 

If the initial propensities q l 0 are discretized to the mesh {qk = kh}, k £ Z, 
then for all times t n , propensities q l n belong to the same mesh. 

We are interested in the function f(t n ,q ) which is determined as the 
proportion of all agents that have propensity q — qk, k e Z, at time t n . 
That is, f(t n ,q ) is PMF (probability mass function) for the propensity of a 
randomly selected agent. We may write 

f(t n ,q ) = a k S (q ~ Qk), 

k 

where S(q — qk) is the delta mass supported at qk, and a% are non-negative 
numbers, summing up over A; to 1. They are defined as 

n # of agents at time t n with propensity qk 


We are interested in two integrals of /. The first, 

a(t n ) = J p(q)f(t n ,q)dq (4) 

is the fraction of the average number of entries to the market at (n + l) th 
round, and the second 

b(tn) = J P(?)( 1 ~ P(Q))f(tn, q) dq > 0, (5) 

that we call the coefficient of sorting. The sorting of the population of agents 
into two groups is expressed by the smallness of b(t), since it implies that 
f(t,q) is supported either on large negative q's (rarely enter the market) or 
on large positive values (enter almost all the time). 
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We investigate the conditions under which aft) approaches the fraction of 
the market capacity c/N and b(t) converges to zero. Thus, when working with 
the distribution function, we can not say to which particular equilibrium the 
system converges, but we still have enough information to say that the system 
does approaches an equilibrium and the equilibrium is a pure asymmetric one. 

Let us also mention that studying distribution functions, instead of the 
dynamics of individual particles (agents) is a classical approach in Science, 
with the examples ranging from the Boltzmann equation of gas dynamics 
and the diffusion processes describing the Brownian motion to equations for 
distribution of commodities in social and economic studies, see Feller (1957), 
Ch. XIV, and Pareschi & Toscani (2014). 


2.2. Time scales 


We will show in Appendix A that in the asymptotic regime 

Nh 


N —y oo, h —y 0, t — y 0, 


—y r, 


for some r G M + , the density f(t, q) of the basic reinforcement learning verifies 
the following nonlinear drift-diffusion equation. 


dtf + r{K - a)d q (p(q)f) - -(WV7i(k - a) 2 + rhb)d 2 (p(q)f) = 0, (6) 


with 


a= / p(q)f(t, q) dq, b= p(l - p)f{t,q) dq, 


where k = c/N > 0 is the capacity of the market expressed as a fraction of 
the total population of agents. 

The equation ([6]) has rather simple structure. 

It consists of diffusion of the density, with the diffusion coefficient 




1 

2 


(rNh(n 


a) 2 + rhb)p'(q) > 0, 


and the drift (transport) of the density / with the velocity 

v(t, q) = r(n - a(t))p(q) - fip\q). 

In this notation equation (J6J) takes the form: 

dtf + d q (vf) - d q (fjtdgf) = 0. 
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We will show in Appendix B that in the long run a(t) approaches k and 
the characteristic time of this convergence is 


7~al 

r 


( 7 ) 


On the other hand b(t) approaches zero and with the characteristic time 

1 

Ts ~ Vh' 

Note in particular that the sorting is slow compared with the aggregate learn¬ 
ing: 

Ts > T ai . 

The equation ([6]) provides a convenient description of processes governing 
the dynamics of repeated games. Starting from the initial distribution of 
propensities, the system quickly moves towards the state of the aggregate 
learning by punishing or rewarding all agents for deviations from the market 
capacity k. This is expressed in © by the drift velocity being proportional 
to (k — a(t)). 

The stochastic nature of the decisions that agents make and their inde¬ 
pendence result in the tendency of the individual propensities to spread out 
over the propensity space, which is expressed by the diffusion part of the 
equation ([6]). This results in the convergence of f(t,q ) —* 0, for every q, 
meaning that fewer agents are using mixed strategies to play the game, lead¬ 
ing to sorting. The sorting requires significant amount of time since diffusion 
coefficient is small. 


2.3. Time scales for the second model of learning 

For the model of learning (J3]) the density / verifies a similar equation 

d t f + r(n — a)d q f — ^(rNh(t c — a) 2 + rhb)d 2 f = 0, (9) 

which lead to the same estimates for time scales (J7|) and (jBJ). 

Note here the difference between equations d6]) and dHJ) : the drift velocity 
and the diffusion coefficient in the model of fictitious play is homogeneous 
(independent of) in the propensity q, whereas in (j6j) both proportional to the 
probability p(q). The later case indicates that the intensity of the drift and 
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diffusion is larger for agents who have higher probability (and thus propen¬ 
sity) to enter. This, in fact, is expected, as the only way to change the 
propensity of an agent in the reinforcement model is for her to enter the 
market. 

This difference results in the sorting being somewhat faster for agents 
with low propensity in the model (P), once the system gets close to the 
equilibrium. 

Appendix A. The drift-diffusion equation 

In this section we give a heuristic derivation of drift-diffusion equations 
() 6 |) and ([9]) . Consider a repeated market entry game with basic reinforcement 
learning ( 0 Q), ©• 

Let f(t nj q ) be the PMF of the propensity of a randomly selected agent, 
as in section 12.11 and X n be a random variable with this PMF. 

At time t = t n we select an agent at random and observe her propensity 
level: q = X n . The probability that the agent will enter market at the next 
round is p = p(q). q belongs to the mesh {kh} and we denote the correspond¬ 
ing mesh number 

k = | e Z. 
h 

If k 7 ^ k, there are Nf(t n ,q k ) = Na k agents that have propensity q k ', 
if k = k, there are iV«? — 1 agents (besides the one we selected) that have 
propensity q. 

Since each agent behaves independently from others, the number of agents 
among Na k (or #«" — 1 ) who will enter at the next round is a binomial 
random variable, that we denote X k (or Xg): 

X k e B(p k , Na%), p k = p{qk ), k ± k, 


and 

XVeB(p h Nal~ 1 ). 

Here B(n,p) stands for the binomial distribution of successes in n Bernoulli 
trials, with the probability of success p. 

We compute the expectation and the variance of X£, 


E[ AT] = N aip t , V(X£) = Na n tPt ( 1 -p t ), k + k, 



and 


E{ Af] = (Noil ~ V ( X t) = ( Na l ~ 1)«(1 - Pi)- 

Let ?h n be the total number of agents who enter the market, not counting 
the selected one, i.e., 

m„ = XI 

k 

Then, using the pairwise independence of A^’s, 

E[m n ] = N f p(q)f(t n , q) dq - p h 

and 

V(m n ) = N J p(q)(l-p(q))f(t n ,q) dq-p k (l-p k ). 

Let 5 n be the indicator function for the selected agent to enter the mar¬ 
ket, which is, by design of the model, independent of rh n . For such random 
variable, 

Prob(<5 n = 1) = 1 - Prob(<5 n = 0) = p~ k . 

At the next time, t = t n+l l the propensity of the selected player to enter the 
market by ([ 2 ]) equals 

A" n+ i X n T S n h(c di n 1^. (A.l) 

We will assume that X n+ i is a good approximation of the propensity to enter 
the market of a randomly selected agent at time t = t n+ i. In other words, the 
distribution of X n+ \ is given by f(t n+ i,q). 

This assumption suffices to the derive the equation for f(t,q). Let 0 be 
a test function and compute 

J <f>(q)f(tn+. i, q) dq = E[(f(X n+ i)\ = E[(j){X n + 5 n h(c - m n )\ 

= J ( p{q)E[<t>{q + h(c - m n - 1))] + (1 - p(q))</>(q)) f(t n , q) dq. (A.2) 

Using the Taylor’s expansion 

<t>{q + z) = <f>(q) + + ^-jp-z 2 + o(z 2 ), 
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we obtain from the previous computation: 


J </>(<?) dq = ^-J cj)'(q)p(q)E[c-m n -l]f(t n ,q)dq 

+ 7^ J </>"(q)p(q)E[(c - m n - 1 ) 2 }f(t n: q) dq + (A.3) 

Next we compute 


E[c-m n -1\ = c- N J p{q)f{t n , q) dq + p(q) - 1 

~ N [c/N - Jpf(t n , q) dq^j , (A.4) 
where we assumed that N is large. In a similar way, 


E[(c — rh n — l) 2 ] = (c - E[m n ] - l ) 2 + V ( m n ) 

N 2 (c/N - [pf(t n , q) dq) 2 + N(f p( 1 - p)f(t n , q) dq) ) . (A.5) 


N h c 

Finally we assume that r, Nh are small and r = -, k = — are finite. 

r N 

Returning to (1A.3D and retaining only higher order terms we obtain an 
integral equation 


<l>(QW(t,q)dq = r / <j>'(q)p(q){K- / pf dq)f dq 


rNh 


<t>'\q)p{q) [(k - j pf dq) 2 + J Pi 1 - p)f dq/NJ f dq. (A.6) 


Since the equation holds for an arbitrary test function (f>, we obtain a partial 
differential equation for / : 


d t f + r{n - a)d q (pf) - ~{rNh(K - a) 2 + rhb)d 2 (pf) = 0, 


where 


f f c Nh 

a= P(q)f(t,q)dq,b= p(q)(l - p(q))f(t,q)dq, k =r =—. 
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For the second model of learning (J3J), instead of (IA.1D . the updating rule 
prescribes 

X n ~)-i X n ~\~ h(c hhi n ') d n h. 

Repeating the arguments of the previous case we obtain the equation 
d t f + r(n - a)d q (f) - ^(rNh(n - a) 2 + rhb)d 2 (f) = 0, 
where a, b are the same as above. 

Appendix B. Two time scales 

Formulas (J7]) and (J 8 J) are obtained by taking moments of the equation (j 6 j). 
We multiply the equation by p(q) and integrate in q. Assuming the f(t,q ) 
decays fast enough at infinity we find an ODE for the average entry rate 
a(t) : 

p'pf(t,q)dq S J (hi — a) + ^(rNh(n — a ) 2 + rhb) Jp"pf dq. (B.l) 

The ODE is still not closed as it depends on an integral of /. However, since 
p(q), p'(q ) > 0 and 

p'pf dq < ma,x{p'(q)p(q)}, 

it is of the order 1 and we substitute it with a suitable positive constant c(p). 
Moreover, since Nh ~ r and h are small, the contribution of the last term 
in (IB. ID can be ignored and we obtain the equation 

^ = rc(p ) (« - a) , (B.2) 

and 

a(t) = k+ (a( 0 ) - n)e~ c{p)rt . 

Thus, the ratio (a(t) — n)/ (a(0) — n) decreases to zero with the characteristic 
time length r a i = 1 /r. 

ft is harder to obtain an analytical expression for the coefficient of sorting. 
Qualitatively, the sorting occurs due to the diffusion of the density /. We will 
proceed heuristically, postulating that the rate of decrease of the coefficient 
of sorting is proportional to the diffusion 

(rNh(n — a(t )) 2 + rhb(t))/ 2. 
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Since this quantity is asymptotically smaller the the drift r(n — a(t)), the 
diffusion is of the order rhb(t)/ 2. Thus, we obtain that 


db 

dt 



and b(t) ~ 6(0)e rht / 2 . This formula implies that the ratio b(t)/b{ 0) decreases 
to zero at the characteristic time t s = 2 /(rh). 
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